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Abstract 

In some areas of measurement item parameters should not be modeled as fixed but 
as random. Examples of such areas are: item sampling, computerized item generation, 
measurement with substantial estimation error in the item parameter estimates, and 
grouping of items under a common stimulus or in a common context. A hierarchical 
version of the three-parameter normal-ogive model is used to model parameter variability 
in multiple populations of items. TWo Bayesian procedures for the estimation of the 
parameter are given. The first method produces an estimate of the posterior distribution 
using a Markov chain Monte Carlo method (Gibbs sampler), the second produces a Bayes 
modal estimate. It is shown that the procedure using the Gibbs sampler breaks down if 
for some of the random item parameters the sampling design yields only one response. 
However, in this case, marginalization over the item parameters does result in a feasible 
estimation procedure. Some numerical examples are given. 

Keywords: Bayesian estimates; Bayes modal estimates; Gibbs sampler; item 
generation; item grouping; item sampling; multilevel item response theory; marginal 
maximum likelihood; Markov chain Monte Carlo; sampling design. 
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Introduction 

Item response theory (IKT) models with random examinee parameters have become 
a common choice among practitioners in the field of educational measurement. Though 
initially the choice for such models was motivated by the attempt to get rid of the 
statistical problems inherent in the incidental nature of the examinee parameters (Bock 
& Lieberman, 1970), the insight soon emerged that such models more adequately 
represent cases where the focus is not on measurement of individual examinees but 
estimation of characteristics of populations. Early examples of models with random 
examinee parameters in the literature are given in Andersen and Madsen (1977) and 
Sanathanan and Blumenthal (1978), who were interested in estimates of the mean and 
variance in a population of examinees, and in Mislevy (1991), who provided the tools for 
inference from a response model with a regression structure on the examinee parameters 
introduced to account for sampling from populations of examinees with different values 
on background variables. 

In traditional large-scale testing, a statistical necessity to model item parameters in 
IRT models as random has hardly been felt. Typically, the values of the item parameters are 
first estimated from large samples of examinees, with the examinee parameter integrated 
out of the likelihood or posterior distribution. During operational testing, because the 
calibration sample was large, the item parameters are fixed to their estimates and treated as 
known constants rather than as incidental parameters with unknown values. Nevertheless, 
the measurement literature shows a recent interest in response models with random item 
parameters. The reason for this phenomenon is the insight that such models better 
represent the use of sampling designs that involve random selection of items or cases 
where sets of items can be considered as exchangeable once we know they belong to the 
same ’’group” or ’’class”. 

The most obvious case of measurement with random item characteristics arises is 
domain-referenced testing. In this type of testing, the idea of assembling a fixed test for 
all examinees is abandoned in favor of a random sample from a large pre-written pool 
of items for each examinee (e.g., Millman, 1973). The model originally used to guide 
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domain-referenced testing programs with dichotomously scored items was the binomial 
error model (Lord & Novick, 1968, chap. 23), given by 

Pr{X n = X | k,TTi} = Q^7T*(1 - 7T n ) fc_I , 

where X n is the number of successes for examinee n on a test of size k sampled from the 
domain and ir n is the examinee’s success parameter. Clearly, the success parameter in this 
model depends both on the examinee and the domain of test items. Attempts to decompose 
7 v n into separate components for the examinee and the items led to the introduction of IKT 
models with random item parameters. One of the first models of this kind is found in 
Albers, Does, Imbos and Jansen (1989), who needed an explicit examinee parameter to 
estimate progress of learning in a longitudinal study with tests sampled from the same 
pool of items at different time points. 

A more sophisticated application of the idea of item sampling has become available 
through the introduction of computer-generated items in educational measurement. Using 
an item-cloning technique (see, for instance, Bejar, 1993, or Roid & Haladyna, 1982), it 
is no longer necessary to write each item in the domain individually. Instead they can 
be generated by the computer from a smaller set of ’’parent items” through the use of 
transformation rules. One of the more popular types of computer generation of items 
is based on so-called “replacement set procedures” (Millman & Westman, 1989), where 
the computer is used to replace elements in the parent item (e.g., key terms, relations, 
numbers, and distractors) randomly from well-defined sets of alternatives. Because the 
substitution introduces (slight) random variation between items derived from the same 
parent, it becomes efficient to model the item parameters as random and shift the interest to 
the hyperparameters that describe the distributions of the item parameters within parents 
(Glas & van der Linden, 2001). Observe that this application is more general than the 
previous one because we now consider sampling from multiple populations of items in 
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The current trends towards increased testing in education and individualization 
of test administration have put stress on the resources for item calibration at testing 
organizations. As a consequence, it becomes attractive to find alternatives to the 
traditional large-sample approach to item calibration. A possible solution is to accept 
non-negligible estimation error in item parameter estimates and treat them as random 
in operational testing, e.g., using their posterior distribution when assembling a test 
or estimating examinee parameters. The first to deal with this problem in IRT were 
Tsutakawa and Johnson (1990; see also van der Linden & Pashley, 2000). The problem 
of how to deal with posterior distributions for the item parameters in an adaptive testing 
procedure has been addressed in Glas and van der Linden (2001). 

An omnipresent feature of mainstream IKT models is the assumption of conditional 
independence between the response variables given the examinee’s ability level. 
However, it has long been known that items that share a common element may loose 
this feature. Examples are sets of items with a common stem or items sharing a common 
context because the test is organized as a set of fixed testlets (Wainer & Kiely, 1987). To 
deal with this problem, Bradlow, Wainer and Wang (1999; see also Wainer, Bradlow & Du, 
2001 ) replaced the well-known parameter structure in two-parameter and three-parameter 
IKT models by 

O-i(0 n — bi — 7 nd(i))’ 

where 6 n , bi and a* are the traditional parameters for the ability of examinee n and the 
difficulty and discrimination power of item i. The new parameter 7 nd ^ was introduced 
to represents a random effect for the combination of examinee n and the nesting of 
item i in testlet d. Observe that this model actually is an (overparameterized) version 
of a multidimensional IRT model with decomposition 7 nd ^ = a i(i 9dn , where 0d n is 
the score for examinee n on an ability dimension unique to testlet d and a id is the 
discrimination parameter for item i on this dimension. Because testlets have a fixed 
structure, randomness of 7 nd ^ cannot come from sampling of the items. However, if 
the examinees are sampled, 9 dn becomes random, and so does 7 nd ^y 
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A final example of the use of a model with random item parameters is given in 
Janssen, Tuerlinckx, Meulders and de Boeck (2000). These authors are interested in the 
process of standard setting on a criterion-referenced test with sections of items in the 
test grouped under different criteria. Because of this grouping, the IKT model is chosen 
to have random item parameters with different distributions for different sections. At 
first sight, grouping of items does not necessarily seem to lead to a model with random 
parameters. However, a general approach to account for dependency due to common 
elements between units is to behave as if they were a stratified random sample from a 
set of subpopulations and model the process accordingly. A Bayesian argument in favor 
of this approach is that if the only thing known a priori about the items is that they are 
grouped under common criteria, they are exchangeable given the criterion and can be 
treated as if they are a random sample. 

It is the purpose of this article to give a Bayesian treatment of the problem 
of estimating the parameters in a model with random item parameters and multiple 
populations of items. The model does not only allow for all item properties that have 
traditionally been modeled using the three-parameter logistic model (item difficulty, 
discriminating power, and possibility to guess) but also for dependency between these 
features within populations (e.g., correlation between parameters for discriminating 
power and guessing). The treatment is fully Bayesian in the sense that (informative) 
priors are formulated for all hyperparameters describing the distributions of the item 
parameters within the populations. Two estimation procedures are presented. In the first 
procedure, the posterior distribution of all parameters are generated concurrently using 
a Markov chain Monte Carlo (MCMC) simulation algorithm (i.e., the Gibbs sampler). 
In the second procedure, Bayesian modal estimates for a subset of the parameters are 
computed marginalizing over the other parameters. Before presenting the procedures, 
a feature of the sampling design for collecting the response data critical for the choice 
between the parameter estimation procedures is discussed. 
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Sampling Design 

The sampling design governs sampling of items and examinees in the calibration 
study and thus controls how much response data we have for each possible realization 
of the random item and examinee parameters. A critical feature for item parameter 
estimation in the multilevel model below is the number of responses per realization of 
the random item parameters. If, as will become clear below, this number is equal to one 
for some of the items, a procedure for concurrently estimating the posterior distributions 
of all parameters in the model, breaks down in the sense that we have too little data, that is, 
no statistical information can be aggregated for the some of the parameters in the model. 
In the sequel, these item parameters will be called incidental item parameters. 

A practical illustration of the distinction between a sampling design where all the 
item parameters can be treated as structural and a sampling design where some of 
the item parameters are incidental is the case of computer-generated items discussed 
above. One possible implementation of computer-based item generation is to have the 
computer generate a new item for each examinee (’’item generation on the fly”). Another 
implementation is to generate a set of item clones prior to operational testing and sample 
from this set during testing. In the former case, all item parameters are incidental; in the 
latter case, some items will have incidental parameters if the set is large relative to the 
population of examinees tested and the design involves random assignment of items to 
examinees (as, for instance in adaptive testing). 

The distinction between structural and incidental parameters in statistical models 
has been introduced by Neyman and Scott (1948; also see, Kiefer & Wolfowitz, 1956). 
In an estimation problem with structural parameters, the number of parameters remains 
finite if the number of observations goes to infinity, whereas in a problem with incidental 
parameters the number of parameters goes to infinity. The presence of incidental 
parameters causes problems for statistical inferences, for instance, the solutions to 
the likelihood equations for the structural parameters may loose their consistency or 
asymptotic efficiency. 
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If each examinee gets a different item, the random item parameters are incidental 
parameters in the sense of Neyman and Scott. If the items are sampled from a finite 
set, their parameters are structural. However, the latter may still result in inestimable 
parameters in the Bayesian framework below. Nevertheless, one of the proven measures 
to solve problems with incidental parameters - marginalizing them out of the likelihood 
function - also works for the case in which some examinees get unique items. For such 
cases a marginal maximum likelihood approach is presented. This solution can be used 
as an alternative to the Bayesian framework in testing with computerized item generation 
if new items are generated on the fly for each examinee, or in any other application of 
response models with random item parameter with too few responses per item. 

We will return to this issue in the Discussion section to discuss other sampling 
designs that complicate parameter estimation in models with random item parameters. 
In fact, as already admitted in Newman and Scott (1948), more complex cases exist 
in which parameters appear in varying combinations of random variables. Educational 
measurement with random item parameters and incomplete sampling design clearly 
belongs to this category. 



The Model 

Consider a set of item populations p = 1, ..., P of size ki , ..., fcp, respectively. The 
items in population p will be labeled i p = 1, ...,k p . It proves convenient to introduce 
sampling design variables d nip , which assumes a value equal to one if person n responded 
to item i p and zero otherwise. Let X nip be the response variable for person n and item 
i p . If d nip = 1, X nip attains the value one for a correct response and a value zero for an 
incorrect response. If d nip = 0, X nip attains an arbitrary value r(r / 0;r/ 1). Notice 
that with this definition the design variables are completely determined by the response 
variables; they are only introduced to facilitate the mathematical presentation. 
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First-Level Model 

The first-level model is the three-parameter normal ogive (3PNO) model, which 
describes the probability of a correct response as 

Pi^nip = 1 | d n i p = 1, 9 n , di p , bi p , Cj p ) = d p + (1 dp)*&(Q,ip@n ^i p )> (^) 

where a ip ,b ip , and c* p are item parameters, 6 n is an examinee parameter, and $(.) is 
the normal cumulative distribution function. The parameterization of the models in (1) 
is slightly different from the usual parameterization for the logistic and normal-ogive 
models, a,i p {6 — bi p ). The only motivation for our choice is to simplify the presentation 
below. 

The reason for considering the 3PNO model rather that the 3PL model is that the 
former appears to be more tractable in an MCMC framework. However, as is well 
known, for an appropriately chosen scale factor both models are numerically nearly 
indistinguishable and either model is expected to fit only if the other does. 

Second-Level Model 

The values of the item parameters (aj p , bi p) d p ) in (1) are considered as realizations 
of a random vector. We will use the transformation 



£i p = K» h P ’ lo git Cip), ( 2 ) 

which gives the item parameters scales for which the following assumption of multivariate 
normality is reasonable: 



~ JV(/v S„). (3) 

where /n p is the vector with the mean values of the item parameters for population p and 
£ p their covariance matrix. Observe that the hyperparameters (pt p , E p ) are allowed to 
vary across the populations of items. 
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In the inferences below, we assume that 6 n has a standard normal distribution 

0 n ~N{ 0,1). (4) 

This assumption holds if examinee n is from a population of exchangeable examinees with 
a normal distribution of abilities. Examinees and items are thus distributed independently, 
that is, we do not assume that the items are sampled dependently on the examinee abilities. 

Prior for Hyperparameters 

A convenient choice for the prior distribution for the hyperparameters (^i p ,E p ) is 
a normal-in verse-Wishart distribution (see, for instance. Box & Tiao, 1973, or Gelman, 
Carlin, Steam & Hall, 1995). The prior follows from the specification 

E p ~ Inv — Wishart Vo (Y,Q ) 

I E p ~ MVN(n 0 , S p /«o) 

and has a density given by 

p(M p , S p ) oc | Eo |-«-+W2+i> ex p ^-lirfSoSp- 1 ) - - nS) , 

( 5 ) 

where S 0 and vq are the scale matrix and degrees of freedom for the prior on E p and 
fj-o and kq are the weight for the prior on fj, p , respectively. The weight expresses the 
information in the prior distribution as the number of prior measurements it can be equated 
to. 

It should be noted that, though the hyperparameters E p ) are allowed to take 
different values across populations, a common prior is specified for all hyperparameters. 
The function of the prior is only to bound their distribution to a likely region of possible 
values. 
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Likelihood Function 

The response vector of examinee n is denoted as = (x n i lt , ...,x nip , ..., x nip ). 
Using the assumptions of (1) independence between examinees, (2) independence 
between items and examinees, and (3) local independence within examinees, the 
likelihood function associated with response data x = (x n ) can be written as 

p(0>€>**>E;x,( d«)) = JJp(x„ | d n ,0 n ,£,/i,:S) 

n 

= nnn^n, I £t p )p(^n) 

n V ip 

im**.!*.*)- < 6 > 

P b 

The convention will be followed that p(x nip \ d nip = 0, 6 n , a ip , b ip , d p ) = 1. 

Discussion 

The current model for random items and multiple item populations differs from 
the multilevel IKT models for testlets in Bradlow, Wainer & Wang (1999) and Wainer, 
Bradlow, and Zu (2000) in that the latter only has a random interaction parameter between 
examinees and items but fixed parameters a,, 6,, and q. The statistical treatment of the 
models is the same, however; in these two papers an MCMC framework is used to estimate 
the parameters as well. The current model also differs from the one in Albers, Does, 
Imbos and Jansen (1989). These authors use a one-parameter version of the normal-ogive 
model, i.e., the model in (1) with a, = 1 and q = 0, but add a growth parameter for each 
examinee that is assumed to increase linearly over time. Finally, the model introduced in 
Janssen, Tberlinckx, Meulders and de Boeck (2000) is a two-parameter version of the one 
in (1) obtained by setting c» = 0. Their second-level model specifies independent normal 
distributions for a, and 6, and is thus a special case of (3) with S p reduced to a 2x2 identity 
matrix. These authors also treat parameter estimation in an MCMC framework, but with 
uninformative priors for (/r a) p b ) rather than the prior in (5). 
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Parameter Estimation 

Three methods for estimation of the parameters of the model will be discussed. The 
first two pertain to sampling schemes where the item parameters £ ip can be viewed as 
structural parameters, that is, as the sample size grows, their number remains limited; or, 
in other words, the sampling design is such that statistical information with respect to these 
parameters can be accumulated. The first method is a Bayesian method where the joint 
posterior distribution of all model parameters is evaluated using the Gibbs sampler. The 
second method is a Bayes modal estimation procedure that produces point estimates of the 
item parameters. From a Bayesian perspective, the latter method produces posterior mode 
estimates of the item parameters £ ip , n p and E p , where the posterior is marginalized over 
the incidental parameters 6. The third estimation procedure pertains the case where the 
item parameters £ ip are also incidental. The third procedure is a Bayes modal estimation 
procedure where the likelihood or the posterior is marginalized both with respect to £ ip 
and 9. 

Bayesian Estimation Using the Gibbs Sampler 

In Bayesian modeling, all parameters are considered as random variables. A modem 
approach to produce the posterior joint distribution of the parameters of interest is by 
simulation. A Markov chain Monte Carlo (MCMC) procedure will be used to sample this 
posterior distribution. The chains will be constructed using the Gibbs sampler (Gelfand 
& Smith, 1990). To implement the Gibbs sampler, the parameter vector is divided into 
a number of components, and the components are sampled consecutively from their 
conditional posterior distributions given the last sampled values for all other components. 
This sampling scheme is repeated until the distribution of sampled values forms a stable 
estimate of the posterior distributions. Albert (1992) applies Gibbs sampling to estimate 
the parameters of the 2PNO model. A generalization to the 3PNO model is given by 
Beguin and Glas (2001). A more general introduction to MCMC for ERT models is found 
in Patz and Junker (1999a), whereas applications for models with multiple raters, multiple 
item types and missing data are given in Patz and Junker (1999b), models with a multi- 
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level structure on the ability parameters in Fox and Glas (2001) and multidimensional 
models in Beguin and Glas (2001) and Shi and Lee (1998). 

Data Augmentation 

Beguin and Glas (2001) introduce a data augmentation scheme for the 3PNO based 
on the following interpretation. (In their implementation of the Gibbs sampler they 
choose a Beta prior for Cj p and a uniform prior on the positive real line for a ip , though.) 
Suppose that the examinee knows the correct answer with probability $(A„i p ), with 
A nip = a ip 9 n — b ip , and then gives a correct response with probability one or does not 
know the correct answer with probability 1 — $ ( Am p ) and then guesses the correct response 
with probability q p . The marginal probability of a correct response is equal to $(A n , p )+ 
Cjp(l - $(A nip )). Let 



So if W nip = 0, person i will guess the response to item j, and if W nip = 1, person i will 
know the right answer and will give a correct response. Consequently, the conditional 
probability of W nip = w nip given X nip = x nip is given by 



In addition to W n i p , following Albert (1992), the data are also augmented with latent data 
Z n i p , which are independent and normally distributed with mean A n i p = a ip 9 — b ip and 
standard deviation equal to one. The observed data X nip are considered as indicators of 
the sign of Z nip \ if X nip = 0 or 1, Z nip is negative or positive, respectively. 



{ 



1 if person i knows the correct answer to item j 
0 if person i doesn’t know the correct answer to item j. 



(7) 



P{w nip = 1 I X nip = 1, A nip ,Ci p ) oc $(A nip ) 

P(w nip = 0 I X nip = 1, A nip.Cip) OC Cj p (l - $(A nip )) 
P(W nip = l\X nip =0,\ nip ,c ip )=0 
P(Wni p =0\X nip =0,\ nip ,C ip ) = l. 



( 8 ) 
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Posterior Distribution 

The aim of the procedure is to simulate samples from the joint posterior distribution 
given by 



p(£,0,A*,E,z,w | x) oc p(z, w | x; £, 0)p(O)p(£ | p, E)p(#i, E|^ 0 , S 0 ). ^ 

The right-hand side probability (density) functions are given by (10) (see below), (4), (3) 
and (2), respectively. 

Steps in the Gibbs Sampler 

The steps of the Gibbs sampler are the following. 

Step 1 

The posterior p(z, w | x; £, 0) is factored as p(z | x; w, £, 0)p(w | x; £, 0). For the 
cases with d nip = 1, the values of w nip and z nip are drawn in following two substeps: 

a) w n i p is drawn from the conditional distribution of W nip given the data x and £, 
and 0, which is given in (8). 

b) z nip is drawn from the conditional distribution of Z nip given w, 0 and £, which is 
defined as 



Zni p I w,0,£,x~ 



N(\ n i p , 1) truncated at the left at 0, if w n i p = 1, 
N(\ nj p , 1) truncated at the right at 0, if w nip = 0. 



( 10 ) 



Step 2 

The value of 0 is drawn from the conditional posterior distribution of 0 given z and 
£. The distribution is derived as follows. From the definition of the latent variables Z nip 
it follows that Z nip + b ip = a ip d n + e nip , with e nip being a normally distributed residual. 
Because (a ip bi p ) is fixed, the equality defines a linear model for the regression of Z nip +b lp 
on a ip , with regression coefficient 9 n , which has a normal prior with parameters p = 0 





Variability in Item Parameters - 15 



and < 7 = 1 . Therefore, the posterior of 0 n is also normal. That is. 



o „ N (hdJi±hl°L 1 \ 

\ l/l/+ l/<7 2 ’ (l/l/+ l/<7 2 )/’ 



( 11 ) 



where 



e n = 



'y ^ 'y ^ d nip a ip ( z n i p + b ip ) 



p *p 



/ 



^ ’ S y^ j d nip aj 



p *p 



and 



v = l/ 



dnip a ip- 



p *p 



Step 3 

The vector of random item parameters £ ip is partitioned into <5 = (<5 ip ) = 

(aii,6i 1 ,...,a ip ,6 ip ,...)and c = (ci lt Ci p , ...). Hence, their conditional posterior 
density factors as p(£ ip |0, z ip , fi ip) E ip ) = p( logit Cj p |<5 ip) 0, z ip , n c]6 , 5D ct<5 ) 
p(S ip \0,z ip ,fx p ,'E p ), where pt c ^ and E^ are the expectation and variance of logit 
Cj p conditional on 6 ip . Then the following two substeps are made: 

a) The value of 6 ip is drawn from the conditional posterior distribution of the 
parameters of d given 0, z lp , pi p} and E p The distribution is derived as follows: Parameters 
6i p can be viewed as coefficients of the regression of z lp = (z nip ), on X = (0,-1), with 
—1 being a column vector with entries -1. So we have z lp = X<5 ip + e lp . Only examinees 
responding to item i p are considered here. Further, S lp has a normal prior with mean pi p 
and variance E p . Define S ip — (X £ X) _1 X £ z ip , define d = X £ X<5 ip + E p V i p and define 
D = (X £ X + Ep 1 ) -1 . Then a well-known result from Bayesian regression analysis (see, 
for instance. Box & Tiao, 1973) is that 

s i P I 0, z i p , x ,/V S P ~ N(Dd,D). 




( 12 ) 
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b) The value of c* p is sample from the conditional posterior distribution given 5i p , 0 , 
z i p ) /x c | 5 ,and S c |fi. Let U p be the number of persons who do not know the correct answer 
to item i p and guess the response. For the probability of a correct response of a person 
n on item i p given w nip = 0 it thus holds that P(Y nip = 1 | W nip = 0) = c^. The 
number of correct responses obtained by guessing, S ip , say, has a binomial distribution 
with parameters c* p and t ip . Since logit c* p has a normal prior with parameters pb c \ s and 
£ c |$, the procedure for sampling in a generalized linear model with a logit-link and a 
normal prior (see, Gelman Carlin, Steam & Hall, 1995, sects 9.9 and 10.6) can be used. 

Step 4 

Values for (p p , E p ) are drawn from the conditional posterior distribution given £, 
0, z, and x. The number of items sampled from population p is equal k p . The prior 
distribution in (5) is the conjugate for (p p , E p ). Hence, the posterior distribution is also 
normal-inverse- Wishart, with parameters 



V v 



«o kp ~z 

~Vo + 

Kr) K*) 



(13) 



v p 2 p = u 0 2 p +S + 

Kr, 



(14) 



fc _ _ _ 

where k p = k 0 + k p , u p = u 0 + k p , S = J](^ p - £ P )(£ ip - t P Y and £ p = 

ip 

corresponding posterior distribution is thus given by 



E«v The 



1 P 



E p | 4 P ~ Inverse-Wishartfc-i(S _1 ), 
V P I Sp^ p ~ N{l p , Ep/fc). 



(15) 



The procedure thus amounts to iterative generation of parameter values using the 
above four steps. Multiple MCMC chains can be started from different points to 
evaluate convergence by comparing the between- and within-sequence variance. Another 
approach is to generate a single MCMC chain and to evaluate convergence by dividing 
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the chain into subchains and comparing between- and within-subchain variance. For these 
and other technical details, see Gelman, Carlin, Steam and Hall (1995). 

Necessary Condition on Sampling Design 

As already discussed, the procedure breaks down if the examinees are administered 
unique items. This point can now be illustrated using the steps in the above Gibbs 
sampler. For example, Step 3a is based on a normal linear model z ip = X<5 Jp + e ip . 
However, if random item i p is administered to one examinee, z Jp has only one entry, 
and it is not possible to estimate two regression coefficients from one observation. 
Likewise, the generalized linear model in Step 3b is based on one observation, and here 
the estimation procedure also breaks down. A solution to this problem is marginalization 
of the likelihood function over the random item parameters. The remaining structural 
parameters are the hyperparameters E p ). The estimation equations for these 
parameters based on the marginal likelihood are given in the next section (cf. van der 
Linden & Glas, 2001). 



Bayes Modal Estimation 
All Item Parameters Structural 

In Bayes modal estimation (Bock & Aitkin, 1982), a distinction in made 
between structural and nuisance parameters. If the number of item parameters is 
limited, that is, this number does not depend on the number of respondents, the 
item parameters can be viewed as structural parameters and the ability parameters 
are the nuisance parameters. The structural parameters are stacked in a vector 
V — (£ij) •••) £i p) •••, Md Si, ..., ii p , E p , ..., Up, Ep). The structural parameters are 
estimated from a log-likelihood marginalized with respect to the nuisance parameters. 
That is, the so-called complete data likelihood given by (6) is integrated over the nuisance 
parameters. As a result, the marginal probability of observing response pattern x„ is given 
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by 



P( Xn | d n , 




dnipi 9 n , £i p )p(9n)d9 n , 



and the marginal log-likelihood function of rj is given by 



logL( rj ;x) 



EE 



p n 



log Pi Xn | £p) ^ ^ logp(£j„ I dni p ) P'pt S p ) 



+ logp(Mp,Sp|At 0 ,So) 



(16) 

(17) 



As above, the convention is used that p(x nip \ dn ip = 0 A>£i p ) = 1. The marginal 
likelihood equations for rj can be easily derived using Fisher’s identity (Efron, 1977; 
Louis 1982; see also Glas, 1992, 1998). The first-order derivatives with respect to rj 
can be written as 

d d 

— log L( T7 ; x ) = fpA r ),0n\x n ) |x n ,»7) = 0, (18) 

p n ' 

where £ n log f P] n{ , nfi TL ) is the complete data log-likelihood, that is, 

£ P !og f P ,n(vA \x n ) = 

£p Era |p( I dji> £pi 9 n ) + logp(^n) + Ei p log p(£ Ip | d nip , E p ) 

+ logp(At p ,Ep|M 0 ) E 0 ). 

Notice that the first-order derivatives in (18) are expectations with respect to the 
conditional posterior density of the nuisance parameters. 

Let P nip and $ mp be defined by P nip = p(x nip \ d nip = 1 ,9 n ,£ ip ) = a p + (1 - 
% )®m p , so $m p is the normal-ogive part of the probability P nip . By taking first order 
derivatives of the logarithm of this expression, likelihood equations for the parameters 




r "> 

c 
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£ ipU , u= 1, 3, are found as 



E 4 

n|d nip = l 



( %ni p Pnip )*&ni p {@n ^i p ) 



nip 



Xn,r? + (flip - /i pl ) = o, (19) 



E < 

n\d nip =l ' 



( Pnip %ni p )*&ni p Q'ip 



n,v) +{bi p ~ V p2 ) =0, (20) 



and 



E * 

^A^nip — 1 



{ ( -t-nip Pnip )(1 ^m p ) c i p (l Qp) 

\ Prii p ( 1 — Pnip) 



Xn ,V 



+ (logitCj p - fji p 3 ) = 0 . ( 21 ) 



These expressions are a straightforward generalization of the usual likelihood 
equations for the 3PNO; for details, refer to Glas (2000). It is easily verified that the 
likelihood equation for the parameters of the parent items are given by (13) and (14). 
The likelihood equations be solved using an EM or Newton-Raphson algorithm. Since 
the number of parameters in practical applications will be quite large, the latter algorithm 
will seldom be feasible. Expressions for confidence intervals can also be derived using 
Fisher’s identity (Louis 1982; Mislevy, 1986, Glas, 1998). However, the computation 
of the asymptotic covariance matrix of the estimates also involves the inversion of a 
matrix of second-order derivatives (information matrix). In the application presented 
below, only the information matrix within the item populations will be inverted, that is, 
the covariance between the populations will be assumed zero. This approximation results 
in confidence intervals that are larger than the confidence intervals obtained when the 
complete information matrix would be inverted. 

Incidental Item Parameters 

In the case every person is administered a unique item, say in a procedure where 
items are generated on the fly, the situation is different in the sense that the random item 
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parameters are unique for every person. This will be made explicit by adding an index n 
and writing £ ipJ1 . Now the number of item parameters £ fpn grows with the sample size, 
so it is doubtable whether they can be consistently estimated, and, therefore, they must 
be viewed as nuisance parameters, together with the ability parameters (Neyman & Scott, 
1948; Kiefer & Wolfowitz, 1956). These nuisance parameters are stacked in vectors 0 and 
£, respectively. This leaves rj = ..., n p , S p , ..., n P , £ P ) as structural parameters. 

The marginal probability of observing response pattern x„ is now given by 



p{ v) — I ••• I | j[ P( x ni p | dntp, 9 n , £i p „)p(£i p 7 J/ip, ^p)p{@n)d£i pn dQ n 

J P,ip 

= J 1 1 J ••• J p{ x nip \ dnipi^n, £i pn )p(£i p n\d n i p , tl'p,^Jp)d£ ifn 

P t lp 



p(9 n )d6 n . 



Notice that (8) entails a multiple integral over £ ipn . It now follows that the likelihood 
equations are given by 

= i < 22 > 



< 23 > 

n 

and 

&puv = ^ ' E{€pu.€pu I *111 v) ~~ P'puP'pu, ( 24 ) 

n 

where indices u and v ^ u denote the uth and wth element in the parameter vectors. 
Again, these equations can be solved using an EM or Newton-Raphson algorithm. 

Discussion 

A final remark concerns a case where each generated item is administered to more 
than one person, but the number of generated items still grows with the sample size. In 



£ 2 / 



o o 
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that case, the random item parameters must still be considered as nuisance parameters. 
Consider the case where each random item is given to two respondents, say n and m. 
The responses of both respondents now depend on the same random item parameter; this 
dependency will be made explicit by labeling this item parameter as £ ipnm - The complete- 
data likelihood can now be written as 



Notice that the integral does not factor further. In fact, as the number of respondents 
receiving the same random item goes up, we are quickly left with a multiple integral that 
cannot be computed by the usual Gauss-Hermite procedure (see, for instance, Glas, 1992). 
Fortunately, the fully Bayesian procedure discussed above does not have these problems. 



A number of studies were conducted to assess the feasibility of the procedures in 
practical situations. In some practical situations, the number of responses per population 
of items might be quite low and the number of item parameters might be quite high. In 
such cases, the convergence of the MCMC- or the EM-algorithm to reasonable parameter 
estimates is not a priori obvious. On the other hand, in a Bayesian framework the 
computation of estimates can be supported by a sensible choice of priors. 



p(x | 0,£,/4,£) = n im*** i ^ nip > $i p nm) 



(n,m) p i p 



p{Xmi p | d rn i p , 9 m , Ci p nm)p(^i p nm I 9, m i p , fJ>p,^Jp)p(9 n )p(9 rn ) , 



where the product is over pairs of respondents, and marginalized results in 




Some Numerical Examples 
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The study consisted of two stages. In the first stage, two real data sets were analyzed 
to obtain some idea of the covariance between the item parameters. Then, in the second 
stage, the estimates obtained in the first stage were used in a number of simulation studies 
aimed at assessing the quality of parameter recovery. 

The first data set consisted of the responses of 429 students to 10 multiple choice 
items in a computer based test for a course on naval architecture at the Ngee Ann 
Polytechnic in Singapore. The data were collected in 1999 and 2000. The numerical 
information in the item stem and the response alternatives was randomly changed every 
time an item was administered. The second data set consisted of the responses of a sample 
of 4000 students from the population participating in the 1991 central examination on 
French language comprehension in Secondary Education in the Netherlands. In this case, 
the test was a traditional paper-and-pencil test. Students were clustered in 116 schools, 
and it was assumed that the item parameters varied across classes. It was expected that the 
item-parameter variance might be high in the first example and low in the second example. 
In the first example, Bayes modal estimates of fj, and £ were obtained by marginalizing 
over all incidental parameters £ and 9. In the second example, two procedures were used. 
In the first procedure, concurrent estimates of £, 9,fi and £ were obtained using the 
MCMC method run with 13,000 iterations, 3000 of which were bum-in iterations. Below, 
expected a posteriori (EAP) estimates are reported as point estimates. In the second 
procedure, Bayes modal (MAP) estimates of £, fi and £ were obtained by marginalizing 
over 9. Computations were carried out using the EM-algorithm. 

In both examples, the same prior covariance matrix £o was used. The values in £ 0 
are shown in Table 1; they are no more than an educated guess. For instance, the negative 
covariance between the discrimination parameter a,i p and the logit-guessing-parameter 
logit Cj p is based on the consideration that, to obtain similar the item characteristic curves 
(ICCs), the discrimination parameter must go down if the lower asymptote goes up. In the 
same manner, when the respondents are relatively proficient, lowering the item difficulty 
parameter can be counterbalanced by lowering the discrimination parameter. This feature 
accounts for the choice of a positive prior covariance between the two parameters. The 
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prior for the parent item parameters was chosen equal to /x 0 = (1.0, 0.0,logit(0.25)). To 
obtain convergence in the analysis of the language comprehension data, it turned out that 
the parameters in the normal-in verse- Wishart prior for (/x p , £ p ) had to be set equal to 
v 0 = 10 and «o = 10, respectively. Since k p = 160, this choice results in a slightly 
informative prior. An uninformative prior sufficed for the Naval Architecture data. 

The averages of the point estimates of the covariance matrices are shown in Table 1 
(first three columns), together with their confidence intervals (last three columns). It 
can be seen that both the posterior variance of the item discrimination and difficulty 
parameters was generally lower than expected. For the EAP estimates, the posterior 
standard deviation is reported; for the MAP estimates, the values computed using the 
normal approximation are shown. It can be seen that the estimated variances are lower 
than the prior variances. Further, the standard errors of the MAP estimates are smaller 
than those of the EAP estimates. This effect is consistent with the findings of Glas, Wainer 
and Bradlow (2000). They argue that posterior distributions of bounded parameters, such 
as a variance or a discrimination parameter, are skewed. The standard error of the MAP 
estimate used here is based on an assumption of asymptotic normality, which, in turn, is 
based on a Taylor-expansion of the likelihood which terms of order higher two ignored. 
The fact that here only the within-item-population information matrices were used to 
obtain the standard errors did not nullify the effect. 

[Table 1 about here] 

The second part of this study was aimed at assessing the quality of parameter 
recovery. Since the difference in the covariances obtained for the two examples given 
above was not dramatically different, it was decided to study two conditions in two 
simulation studies. In the first simulation study, the prior parameters fi 0 and S 0 were 
the same as in the examples presented above. The parent item parameters, fi p were 
drawn from a normal distribution indexed by /z 0 and £ 0 > and £ p was set equal to £ p . 
Then, for each population, 10 items were randomly drawn from a normal distribution with 
parameters ii p and £ p . To produce realistic data, parent and random item discrimination 
parameters drawn below 0.5 were truncated to 0.5. The responses to the random items 
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were generated for simulees with an ability parameter randomly drawn from a standard 
normal distribution. Every simulee responded to 20 random items from 20 different 
populations. So the total data matrix consisted of 1,000 responses. As above, 13,000 
iterations were made, including 3,000 bum-in iterations. To obtain conveigence, the 
parameters in the normal-inverse-Wishart prior had to be set equal to uq = 2 and kq = 2, 
respectively. Since k p = 10, this choice entails a quite informative prior. 

The second simulation study had a similar set-up. The average of the EAP estimates 
of the mean and covariance matrix obtained using the French language examination was 
used as n Q and So- Further, the number of item populations was equal to 40, the number 
of random items per population was qual to 20, and the number of responses to each 
random item was 200. So in this case, the total number of responses was equal to 4,000. 



Some results of the two simulations are presented Table 2. The results are averaged 
over 10 replications and all items. The rows labeled a, b and logit c relate to random 
item parameters; all other rows relate to item-population parameters. The two columns 
labeled EAP relate to EAP estimates obtained using the Gibbs sampler, the columns 
labeled MAP relate to Bayes modal estimates from a posterior marginalized over 9. The 
columns labeled MAE give the mean absolute error of the estimates, averaged over items 
and replications. The columns labeled SE give the posterior standard deviation and the 
normal approximation for the EAP and the MAP estimates, respectively, again averaged 
over items and replications. It can be seen that the magnitudes of these estimates are 
clearly smaller than the corresponding MAEs. Further, especially in the case P = 40, 
the estimates of the covariance matrices seemed much more precise than the estimates 
of the item parameters. This result, however, is explained by the fact that the covariance 
matrices were not varied over populations, but chosen equal to their prior values. Further 
inspection of the results shows that the MAEs of the MAP estimates were somewhat 
smaller than the corresponding EAP estimates. 



[Table 2 about here] 
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[Figure 1 about here] 

Figure 1 shows the posterior distributions of a typical set of parameters for a run 
with P = 20. The three pictures in the first row are the posterior distributions of the 
three elements of /i p for a typical item-population p. The three pictures in the next row 
show the posterior distributions of the three parameters of an arbitrarily chosen random 
item i p . The last two rows give the posterior distributions of the elements of £ p , for the 
same item-population p. The dotted line in the pictures are the asymptotic distributions 
computed using the normal approximation described above. It can be seen that the latter 
approximations are not always realistic. The normal approximation of the variance of 
logit Ci p , for instance, gives discernible larger positive weight to negative values. The 
actual posterior distributions of several elements of E p are notably skewed to the right. 
Figure 2 shows the convergence of the Gibbs sampler for the same 12 parameters. The 
plot is based on the 2,000 draws taken equally spaced from the 10,000 draws following 
the bum-in iterations. From inspection of the plots it can be concluded that the chain 
has properly converged. In practice, visual inspection of the convergence plots of all 
parameters is not very practical. However, convergence can also be evaluated by dividing 
the generated chain into batches and comparing the within and between batch variance of 
the generated values. 



[Figure 2 about here] 

Finally, the figures 3 and 4 give a scatter plot of the generating values (x-axis) and the 
EAP-estimates (y-axis) of the population and random item parameters for two replications 
of both simulation studies. The truncation of the discrimination parameters at 0.5 is 
caused by the generation strategy described above. It can be seen from the plots that 
the relation between the generated and recovered parameters is quite good; in fact, all 
four correlations were above 0.80. Similar plots could not be made for logit c* p and its 
mean and the elements of the covariance matrices, because the variance in the generating 
values was too low and zero, respectively. 
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Discussion 

Several authors (Bradlow, Wainer & Wang, 1999; Janssen, Tberlinckx, Meulders & de 
Boeck, 2000; Wainer, Bradlow & Du, 2001) have proposed IKT models with random item 
parameters. These models, however, do not include within-item covariance of random 
item parameters. In this article, such a model was proposed, and Bayesian estimation 
methods for such models were outlined. It was shown that the sampling design is a crucial 
factor here. If every random item is responded to by a substantial number of respondents, 
Bayesian methods using the Gibbs sampler or marginalization over the ability parameters 
can be used. If only one response is given to every random item, these approaches break 
down. However, in that case, and only then, a Bayes modal estimation procedure using 
a posterior distribution marginalized with respect to ability parameters and the random 
item parameters can be used to estimate the means and covariance matrices of the item 
population parameters. 

A rule of thump for the minimum number of respondents that should respond to a 
random item in the first case was not sought here. However, already with 10 and 20 
random items per parent and 100 and 200 responses to every random item, the prior on the 
covariance matrix had to be informative. Situations with fewer random item parameters 
and observations per random item parameter might be modeled by assuming that all item 
parents have the same covariance matrix, but this suggestion remains a point of further 
study. 
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Table 1 

Prior and posterior values 

item covariance matrix 

Prior Covariance Matrix 



.200 

.100 1.000 
-.050 .050 .100 



French Language Comprehension 
EAP Estimate SE 


.102 


.017 


.031 .208 


.017 .033 


-.018 .010 .116 


.018 .020 .039 



French Language Comprehension 
MAP Estimate SE 



.098 


.014 






.029 .199 


.012 


.025 




-.018 .006 .107 


.015 


.016 


.037 


Naval Architecture 








EAP Estimate 


SE 






.120 


.032 






.027 .122 


.030 


.051 




.001 .002 .110 


.022 


.023 


.073 




32 



Variability in Item Parameters -31 



Table 2 



Parameter Recovery 



p 


ftp 


Tip 


Parameter 


True 


EAP 

MAE SE 


MAP 

MAE SE 


20 


10 


100 


a 


1.000 


0.404 


0.334 


0.407 


0.330 








b 


0.000 


0.514 


0.346 


0.425 


0.214 








logit c 


-1.099 


0.327 


0.654 


0.322 


0.639 










1.000 


0.311 


0.199 


0.283 


0.107 








H b 


0.000 


0.494 


0.307 


0.368 


0.178 








ftlogit c 


-1.099 


0.214 


0.414 


0.188 


0.124 










0.200 


0.076 


0.235 


0.091 


0.059 










1.000 


0.289 


0.684 


0.080 


0.020 








2 

® logit c 


0.100 


0.377 


0.550 


0.477 


0.108 










0.100 


0.089 


0.289 


0.067 


0.014 








& a y logit c 


-0.050 


0.046 


0.227 


0.051 


0.034 








&b, logit c 


0.050 


0.071 


0.470 


0.091 


0.055 


40 


20 


200 


a 


0.950 


0.392 


0.204 


0.424 


0.203 








b 


0.190 


0.365 


0.192 


0.327 


0.141 








logit c 


-0.979 


0.306 


0.294 


0.252 


0.333 








Ma 


0.960 


0.298 


0.095 


0.261 


0.062 








Vb 


0.180 


0.318 


0.124 


0.200 


0.076 








ftlogit c 


-1.002 


0.199 


0.130 


0.163 


0.104 
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Figure Captions 



Figure 1. Posterior densities and normal approximations 
Figure 2. Convergence of the Gibbs samper 

Figure 3. Generating values and parameter estimates K = 20 ,k ip =10 
Figure 4. Generating values and parameter estimates K = 40, k ip = 20 
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